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1. INTRODUCTION 

Every year people get affected by several diseases namely dengue fever, diarrhea, dysentery, and 
cold. Often, people don’t get themselves prescribed by a doctor or go to hospitals for mild symptoms. Rather 
they prefer to buy conventional medicines from pharmaceutical stores. As a result, pharmaceutical stores' 
medication sales data indicates what kind of medications people are taking explicitly. There’s been research 
for predicting disease outbreaks using surveillance data and weather change data. But there are very few 
studies about disease outbreak prediction using pharmaceutical stores’ sales data. Different genres of 
medicines are sold in retail pharmaceutical stores and produced by different pharmaceutical companies. But 
medicines of the same genres are used for some common diseases. These common genres correlate with 
them. Analysis of these correlations might result in predicting potential disease outbreaks in the nearest 
future, which plays the primary motivation for research in this area. Using statistical analysis, machine 
learning, and deep learning with a neural network approach we will try to predict which genres of medicines 
are likely to increase in their sales soon and make an estimation about potential disease outbreaks. 

Correlation is a statistical association between two random variables that can indicate a predictive 
relationship and can be exploited in practice. Outbreak prediction is a way to predict the epidemic potentials 
of diseases using the pattern of medication sales values. Long short-term memory (LSTM), a type of artificial 
recurrent neural network, is best suited to classify and forecast time series data. Retail pharmacy sales data is 
day-to-day sales information. So, we have applied this method to make a predictive analysis of generic 
pharmaceutical drug sales for 30 days. Then assume of disease outbreak based on those analyses. We have 
also made a comparative analysis of our prediction with the actual sales in a month. 

Bangladesh and other South Asian countries are comparatively backdated in medical facilities. A 
large number of people live under the poverty line and don’t even try to reach a doctor unless they are 
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severely sick or wounded. Because of this reason, medical institutions do not have proper information on the 
sales of medicines or diseases detected by doctors. Most people go to nearby pharmaceutical stores and buy 
generic drugs that are used regularly for common diseases. Diseases like dysentery, cold symptoms, diarrhea, 
fever, and gastrointestinal problems are taken very lightly in these regions and are often cured by taking 
common medicines from pharmacy stores. This raises a huge possibility of predicting disease outbreak 
potentials more accurately from the prediction of medicine sales in these retail stores. Successful prediction 
might become highly effective in taking precautionary measures against future outbreaks. For instance, if we 
notice a significant increase in sales of a particular genre of medicines in a month that might indicate a good 
probability of a disease outbreak for which those particular medications doctors will prescribe. In this way, 
we can get a broad view of diseases that people are getting affected with every day, even if they are not 
getting prescribed by a doctor. But sometimes people make wrong assumptions about their diseases and take 
unnecessary or wrong medicines to get cured. As a result, these medicine sales values are random, and very 
hard to predict just by statistical analysis. That’s why we have come up with the idea of applying deep 
learning in this kind of dataset to make a predictive analysis of potential disease outbreaks. Our system can 
be used in medical drug research, financially beneficial to pharmaceutical institutions, and greatly useful for 
public health concerns. 

The main contribution of our works is: to propose a model using LSTM and recurrent neural 
networks (RNN) architecture for predicting medicine sales and compare them with the real sales values. And 
by using this analysis, predict the potential disease outbreaks at a certain time for that particular region. We 
organize the paper as related works are described in section 2, the proposed model and dataset description 
and comparative analysis are addressed in section 3, and results and discussion are placed in section 4. 
Finally, the conclusion and limitations are provided in section 5. 


2. LITERATUR REVIEW 

A few studies have been conducted in the domain of illness or epidemic prediction. 
Deepthi et al. [1] used the patients' symptoms to forecast the disease. Over real-life healthcare data, [2] 
experimented with the revised estimate models. It used a latent factor model to reconstruct the missing data 
to solve the challenge of incomplete data. Also experimented with a brain infarction-related regional chronic 
disease. Several algorithms were utilized to analyze structured and unstructured data from the hospital. 
Mohan et al. [3] used machine learning approaches to predict cardiac disease. They suggested a novel 
strategy for identifying key features which improve the accuracy of cardiovascular disease prediction. They 
attained an enhanced performance level using a prediction model for heart disease that included a hybrid 
random forest with a linear model (HRFLM) where an accuracy level of 88.7%. 

Several methods have been proposed to predict specific disease outbreaks or increases in any 
medication dispensed. One of the proposed methods applies a classical model of susceptible-infectious- 
removed (SIR), which is a classical model that uses differential equations representing disease dynamics and 
predicts the number of people to be affected by the influenza virus using pharmacy sales data [4]. The study 
also visualizes a comparison between the results from surveillance data and pharmacy data as well. Another 
method uses an artificial neural network (ANN) and support vector machine (SVM) to predict malaria cases 
in a state using parameters such as humidity, temperature, average monthly rainfall, the total number of 
plasmodium falciparum (pF) cases, and the total number of positive cases [5]. Using similar parameters 
dengue outbreak prediction studies have also been conducted [6]. Dengue fever predictions have also been 
studied using Pharmacy sales data as well by applying the Bayesian data analysis technique [7]. To predict 
disease outbreaks like gastrointestinal illness and respiratory illness pharmaceutical sales data has been used 
and a method has been proposed that applies ANN that detects changes in the sales trends for over-the- 
counter (OTC) pharmaceuticals [8]. Drug sales data has been proven to be effective in one of the studies 
concerning these kinds of illnesses [9]. Another work has been conducted for price movement prediction 
using a convolutional neural network (CNN) and LSTM [10]. RNNs, use their internal state to process 
variable length sequences of inputs (memory). As a result, they're ideal for research projects involving 
connected handwriting or speech recognition and unsegmented [11]—[14]. In contrast to traditional feed- 
forward neural networks, LSTM has feedback connections [15]. It can handle both single data points (such as 
photos) and complete data sequences (such as speech or video or time series). Kotecka et al. [16] reveal the 
sold drugs volumes for their presence in the wastewater treatment plants. Smith et al. [17] evaluate 
differentiated the relationship between the traditional patient acuity metric and medication regimen 
complexity. Wallis et al. [18] describe the scenarios of the vast usage of self-prescribed medication for adults 
and the arisen difficulties. Reyana and Kautish [19] suggest the necessity of ML tools to help hospital 
administrations and frontlines of hospitals to provide efficient decisions for patient treatment and services. 
Comito and Pizzuti [20] highlight the limitations of existing practices of learning and interpretability of the 
labeled data. To effectively combat the pandemic, this study measures the effects of various non- 
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pharmaceutical approaches [21]. Ye et al. [22] provide a method for forecasting the energy usage of multiple 
jobs based on LSTM and multi-task learning techniques. Haag et al. [23] identified the influential factors of 
pre- and post-COVID-19 to reduce the risk and resilience. Yoo et al. [24] study the parameters of counting 
the lymphocyte and albumin for making a correlation between disease severity and laboratory parameters. 


3. METHOD 
3.1. Proposed model 

First of all, we create the dataset. After that, the dataset required some preprocessing before the 
proposed method could be applied. Then, we applied LSTM RNN for dataset training and testing purposes. 
Finally, we used this trained model to make a comparative analysis of the predicted output and the real 
output. Predict disease outbreaks from one month of future sales forecasting. The workflow diagram is 
depicted in Figure 1. 


——-» | Data preparation and analysis i» | Apply LSTM RNN for dataset 
training and testing 


Make comparative analysis of 
predicted output and real output 


Predict disease outbreaks 


Figure 1. Workflow diagram 


3.2. Dataset description 

We choose the dataset of daily sales of generic medicines in retail pharmacy stores. We collect the 
dataset from Kaggle [25]. The dataset is based on an initial collection of 600,000 transactional data exported 
from individual pharmacies’ point-of-sale systems over six years (2014-2019), showing the date and time of 
sale, pharmaceutical drug brand name, and sold amount. A subset of 57 drugs from the dataset are classified 
using the anatomical therapeutic chemical (ATC) classification system. There are no null values or dump 
data in this dataset. 


3.3. Dataset preparation and analysis 

The dataset contained data on different drugs produced by different companies. But several drugs 
contained the same molecules and were prescribed for curing similar diseases. All of these similar drugs were 
labeled under the same genre. A specific genre of medicine is generally prescribed for one or more than one 
specific disease. Here, the dataset has been classified into 8 groups of generic medicines according to the 
therapeutic chemical classification system. To receive optimal performance, we have used Scikit-learn’s 

MinMaxScaler to scale our data. Then, data now be in a specified format, a 3D array for LSTMs. We began 

by generating data in 60 timesteps and then converting it to an array with NumPy. Following that, we created 

a 3D dimension array of X train samples, 60 timestamps, and one feature at each step. Now, we analyze how 

the 8 categories of medicine sales are fluctuating in day-to-day sales, what are these categories, and what 

kinds of diseases these categories are used for. 

- MOlLAB-MO1AB in Figure 2(a) refers to medicines that comprise anti-inflammatory and antirheumatic 
drugs, non-steroids, acetic acid derivatives, and related compounds. This type includes medications like 
zomepirac (used to alleviate pain ranging from mild to severe), alclofenac (rheumatoid arthritis, 
ankylosing spondylitis, and as an analgesic in severe arthritic diseases are all treated with this drug), and 
bufexamac (used to treat atopic eczema and inflammatory dermatoses on the skin) (see in Appendix). 

- MOIAE- In Figure 2(b) shows anti-inflammatory and anti-rheumatic medicines, non-steroids, and 
propionic acid derivatives. This class of medicine includes tarenflurbil (a pharmaceutical extensively 
given for the treatment of cough and related respiratory tract disorders) and levopropoxyphene 
(a pharmaceutical frequently prescribed for the treatment of cough and related respiratory tract diseases). 
Dextropropoxyphene (an analgesic enantiomer), and oxaprozin (for osteoarthritis, rheumatoid arthritis, 
and juvenile rheumatoid arthritis) (see in Appendix). 
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- NO2BA-This category in Figure 2(c) indicates other analgesics and antipyretics, as well as salicylic acid 
and derivatives. Salicylamide (for the alleviation of pain and discomfort caused by ordinary mouth ulcers, 
cold sores, denture sore spots, infant teething, mouth ulcers, and sore spots on dentures), acetylsalicylic 
acid (pain, fever, inflammation, migraines, and lowering the risk of serious adverse cardiovascular events 
are only a few of the uses for this drug), and acetylsalicylic acid (pain, fever, inflammation, migraines, 
and lowering the risk of serious adverse cardiovascular events are only a few of the uses for this drug) 
(see in Appendix). 

- NO2BE/B- Figure 2(d) refers to the genre-indicated medicines that contain other analgesics and 
antipyretics, pyrazolones, and Anilides. Common medicines in this genre are propacetamol (in 
multimodal analgesia therapy, it is utilized to control perioperative fever and discomfort) (see in 
Appendix). 

- NOSB- Figure 2(e) refers to the genre of medicines that contain psycholeptics drugs, and anxiolytic drugs. 
These kinds of medicines are generally used for insomnia, anxiety, sleep disorder, and related diseases 
(see in Appendix). 

- NOSC- Figure 2(f) refers to the genre of indicated medicines that contain psycholeptics drugs, hypnotics, 
and sedative drugs and are generally used as sedatives and prescribed to cure insomnia (see in Appendix). 

- R03- Figure 2(g) indicates the medicines that contain drugs for obstructive airway diseases. Flunisolide 
(used as a prophylactic therapy in the maintenance treatment of asthma), and dyphylline (used to treat 
asthma, bronchospasm, and COPD) (see in Appendix). 

- RO6- This category in Figure 2(h) includes antihistamine-containing medicines for systemic use. 
Terfenadine (an antihistamine used to treat allergy symptoms), dexbrompheniramine (an antihistamine 
used to treat allergy symptoms, including upper respiratory tract symptoms), and phenindamine 
(an antihistamine used to treat allergy symptoms) are all antihistamines that are used to treat allergy 
symptoms (sneezing, runny nose, itching, watery eyes, hives, rashes, itching, other allergies, and cold 
symptoms are treated with this medicine (see in Appendix). 

All together we can see in Figure 3 that the sales information of those 8 medicine genres is too 
random. Then, it is a challenging task to make a disease outbreak forecasting from these datasets. We have 
used a deep learning model for predicting sales in the next step. 


200 = MO1AB 
= MOTAE 
== NO2BA 
= NO2BE 

Volume = NOSE 
= NOSC 
=æ R03 
— ROG 


1/1/2015 1/1/2016 1/1/2017 1/1/2018 1/1/2019 


date 


Figure 3. Sales variations of 8 ATC categories 


3.4. Model development 

We used a sequential model to start our neural network to apply LSTM. The LSTM layer was then 
added, as well as a highly connected neural network layer. To avoid overfitting, dropout layers have recently 
been added. We set the value for dropout layers to 0.2, which means that 20% of the layers will be lost during 
processing. The dense layer was then added, which decides the output of one unit. Our model used the adam 
optimizer, and the loss was determined using mean squared error. Finally, we optimized the model for a 
batch size of 32 epochs and 100 epochs. To anticipate sales volume for the following 30 days, we employed 
RNN LSTM. In September 2019, we compare the actual and predicted sales volume. 


3.5. Comparative analysis of the predicted and real output 

Different genres of medicines are sold every day in pharmacy stores. The visualization shows a 
correlation among the generic medicine features changing with time. The changes are regular and symmetric. 
These genres are significantly correlated and their random change in sales volume affects sales values 
collectively. A massive probability of an increase in any specific medicine genre might result in the possible 
outbreak of diseases that are diagnosed with them. The mentioned simulation tools have been applied to the 
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dataset and trained after preprocessing. Trained models are then tested for prediction analysis. We trained our 
model using medication sales from 2014 to August 2019. And predicted the sales values through September 
2019 shown in Figure 4. Then, we compared our predicted outcome with the real values and visualized the 
comparison shown in Figures 4(a)-(h). In Figures 4(a)-(c) and Figure 4(f), predicted volume is the average 
sales quantity. Figures 4(d), 4(e), 4(g), and 4(h) shows that model tries to fit the curves but the actual data 
have high fluctuation, so it's not possible to accurately predict the real scenario. In Figure 5, predicted 
outcomes of the medication sales from 11" October 2019 to 9 November 2019 are provided. Next, we will 
predict possible disease outbreaks from the previously observed results. 
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Figure 4. Comparative analysis of predicted outcome vs real outcome for the (a) MO1AB, (b) MO1AE, 
(c) NO2BA, (d) NO2BE, (e) NOSB, (f) NOSC, (g) R03, and (h) RO6 genres 
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Figure 5. Sales prediction from 11" October, 2019 to 9 November, 2019 


4. RESULTS AND DISCUSSION 

The analysis of the prediction for one month from 11 October 2019 to 9" November 2019 shows a 
significant volume of increase in the ATC genre NO2BE shown in Figure 4(d). This means there's a high 
probability of an increase in diseases that are often prescribed to use medicines from NO2BE. This genre 
indicated medicines that contain analgesics and antipyretics, pyrazolones, and anilides. Common medicines 
in this genre are propacetamol (used in multimodal analgesia therapy to reduce fever and pain during the 
intraoperative phase). Other medicines are also used to cure several types of fevers and related diseases. This 
observation leads to the decision that our predicted month has a high possibility of increasing several types of 
fevers, for instance-dengue, malaria, and chikungunya. Though different medications are prescribed for 
different types of fevers, people buy generic drugs from pharmacy stores as assumed medication. So, our 
analysis indicates a possible outbreak of diseases that causes high fever following October 2019. Besides 
this, another significant increase in the sales of another ATC genre NOS5B shown in Figure 4(e). This genre 
indicated medicines that contain psycholeptics drugs, and anxiolytic drugs. These kinds of medicines are 
generally used for insomnia, anxiety, sleep disorder, and related diseases. These are not severe diseases that 
could cause outbreaks. But, the increase in these illnesses is not negligible. 


5. CONCLUSION 

We expect our work to impact several fields of research and development in the future. Our 
proposed method of using LSTM for disease outbreak prediction might inspire other researchers and 
encourage researchers to contribute more in this particular field of research. We also expect our work to be 
efficient and applicable to various fields. The limitation of our research work is that the dataset is inadequate. 
To predict disease outbreaks over a year or a specific amount of time a dataset of only 6 years is not enough. 
Dataset collection was a complicated task as financial sales information is confidential and local retail 
pharmaceutical stores did not want to share them. We plan to extend our research work in the future to 
predict disease outbreaks of a larger duration of time. We also plan to work with more descriptive genres that 
will indicate specific disease outbreaks rather than a genre of medicines. Specific genres of medicines can be 
trained using our proposed model and can predict possible disease outbreaks. 
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